Mekhanik evgenii/1196 #17

EvgeniiMekhanik · 2023-07-31T10:10:13Z

No description provided.

const-t · 2024-01-26T14:20:08Z

include/net/tcp.h

 	if (tcp_send_head(sk)) {
 		struct tcp_sock *tp = tcp_sk(sk);

+#ifdef CONFIG_SECURITY_TEMPESTA
+		if (mss_now != 0)
+			__tcp_push_pending_frames(sk, mss_now, tp->nonagle);


Can we initialize mss_now using tcp_current_mss() and get rid of mss_now != 0?

I think about it, but we don't need to calculate mss_now if SOCK_TEMPESTA_HAS_DATA is not set and tcp_send_head is empty. I don't want to make extra calculations

include/net/tcp.h

const-t · 2024-01-31T10:10:19Z

tso_fragment() not used by Tempesta, we must move it to static scope again

krizhanovsky

LGTM

include/net/tcp.h

net/ipv4/tcp_output.c

krizhanovsky

LGTM

krizhanovsky · 2024-06-09T19:01:56Z

net/ipv4/tcp_output.c

+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned int in_flight = tcp_packets_in_flight(tp);
+	unsigned int send_win, cong_win;
+	unsigned int limit;
 	int result;

 	if (!sk->sk_write_xmit || !skb_tfw_tls_type(skb))


Suggested change

if (!sk->sk_write_xmit || !skb_tfw_tls_type(skb))

if (unlikely(!sk->sk_write_xmit || !skb_tfw_tls_type(skb)))

const-t

From my point of view this patch has some points to improve.

Let's start from discussion about the goals. As I understand our goal is to have as low as possible data in write_queue. For instance, if client is able to receive 1024 bytes, we must place the skb with size 1024 bytes to write_queue and send it in one shot, avoiding any segmentation. In other words tso_fragment must not be called at all.

However this patch works little bit different, the first problem described in comment that marked (2). The second problem is that ss_calc_snd_wnd() calculates cnd_wnd in a wrong way. It completely ignores TCP segmentation limits(see tcp_tso_segs()) and calculates cwnd without choosing min between halfcwnd and in flight bytes(see tcp_cwnd_test()). Second problem leads to segmentation in case when we have interface that has poor count of tso segments or congestion control module limits number of tso segs, thus we place big skb in queue which size bigger than max_segs.

Just assumption, maybe we can cache limit from ss_calc_snd_wnd() and use it in tcp_write_xmit, but it's only theory.

const-t · 2024-06-12T14:30:34Z

net/ipv4/tcp_output.c

+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned int in_flight = tcp_packets_in_flight(tp);
+	unsigned int send_win, cong_win;
+	unsigned int limit;


(1) Why do we recalculate limit here? We have limit that calculated in tcp_write_xmit, the limit calculated correctly and we can use it.

Because we want to make one TLS record for several skbs. For example if mtu is equal 1500, our previously calculated limit is also about 1500, but if there are several skbs in our socket write queue we can encrypt them all in only one tls record! And do not call tls_encrypt several times

const-t · 2024-06-12T14:43:29Z

net/ipv4/tcp_output.c

+						 __func__, mss_now);
+				break;
+			}
+			TFW_ADJUST_TLS_OVERHEAD(limit);


(2) This is looks not optimal. We calculate a limit, then in TFW_ADJUST_TLS_OVERHEAD we reduce the limit by tls overhead, the limit becomes lower than skb->len thus tso_fragment will split this skb, this is unnecessary overhead. After in tfw_tls_encrypt we will encrypt boths skbs, but the last one anyway will hang in write_queue.

Yes because we encrypt several skbs we don't need to adjust overhead

kingluo · 2024-06-13T09:08:20Z

include/linux/skbuff.h

-                                struct {
-                                        __u8    present : 1;
-                                        __u8    tls_type : 7;
-                                        __u16   flags : 16;


Still need flags to notify when skb is transmitted, e.g. tempesta-tech/tempesta#2108.
Of course, on_tcp_entail can do this as well, but it is not currently used for this purpose, at least not for all control frames.

We should implement additional on_send_ function in tempesta for 2108, not use flags in skb. Moreover we remove additional callback from kernel code, so 2108 should be reworked

No, it can't be in on_send callback, it only indicates entry into the write queue of sk, but the counter decrement must happen at the time of actual transmission, because only at that point in time the skb is moved to the TCP protocol transmission process, i.e. tcp_write_xmit or its callers, e.g. __tcp_push_pending_frames, and not accumulated in kernel memory for some reason, such as congestion or zero TCP window of the malicious peer. This is exactly the meaning of this counter, to avoid control frame flooding.

That is, for 2108, the solution is either to keep the counter decrement in sk_write_xmit, and the tfw_cb.flag is still needed, or to manipulate the counter directly in ss_skb_on_tcp_entail. No additional callbacks needed here.

ss_skb_on_tcp_entail is called before pushing skb to socket write queue. It is a good place to decrement counter.

Applications which works in user space and counted control frames don't know nothing about __tcp_push_pending_frames and there is no problem.

I think we can postpone such frames if it is really necessary (and send them only when we have enough window) and count them in ss_skb_on_tcp_entail. I think that we move tls_encrypt to Tempesta code and there is no sense to have one more callback in linux kernel only to count control frames. User space applications like nginx doesn't know anything about tcp_transmit_skb. How they can count control frames?

User space applications like nginx doesn't know anything about tcp_transmit_skb. How they can count control frames?

@EvgeniiMekhanik @krizhanovsky @const-t

I have to update some thoughts.

No, the user program relies on sockbuf to determine whether the skb send is in-flight. If sockbuf is full, the send() system call will fail because the peer does not provide enough window, or in non-blocking fd, the write event is shelved, then the counter will accumulate, and then the user program will know that it is in a flood attack.

(Moreover, the user program can use TCP_INFO getsockopt to retrieve the TCP information to get known the in-flight bytes. In fact, Nginx uses it to define some variables, although not for our purpose exactly.)

However, in our case, it's a completely different story. tempesta does not use sockbuf, but directly enqueues to sk's sk_write_queue, so it will have memory risks unless we know that skb has started to be transmitted, that is, in-flight. How to know this fact? The only place is that tcp_transmit_skb successfully handles skb. So this is why I suggest that we should apply the counting logic there. Neither on_send (called in ss_tx_action) nor entail (corresponding to the original prepare_xmit), or even write_xmit callback, can meet our needs to mitigate the flood of control frame attacks. This is not a theory, but has been verified through testing.

And the method suggested by @EvgeniiMekhanik does not work either, because similar to entail, the skbs pushed when the window is available does not mean that all skbs can be sent immediately, because it does not know how large the network capacity is and how many skbs can be sent, so it's likely that some skbs will be hold in the memory for an undetermined time.

We call ss_skb_on_tcp_entail just before we push skb to socket write queue. For HEADERS and DATA frames we calculate tcp window in sk_fill_write_queue and don't call ss_skb_tcp_entail->ss_skb_on_tcp_entail if we have no enough window. To solve problem with control frames we should not call ss_skb_tcp_entail->ss_skb_on_tcp_entail for all other type of frames if there is no enough tcp window. We should push them to the queue of postponed frames and push them to socket write queue only when we have enough TCP window. In this case we can counted control frames in ss_skb_on_tcp_entail

Fair enough. This is sufficient if we do not care about the strict accuracy of whether the skb was successfully sent by tcp_transimit_skb or was even eliminated from tcp_rtx_queue.

I agree that the flags are replaced with the on_send callback in #1973, so it's not desired to keep flags, especially with only one user. Meantine, I also don't like the idea of additional queueing since to wasted memory decreasing performance and opens opportunities for DoS attacks.

However, in first place, I didn't get what do we fix with #2108, so I requested changes. Maybe I'm wrong and let's discuss the fix, but I proposed to control and limit flows to client in tempesta-tech/tempesta#498 (comment)

P.S. @kingluo @EvgeniiMekhanik thank you for inviting me to the discussion

kingluo · 2024-06-13T09:16:56Z

include/net/sock.h

-						 unsigned int limit,
-						 unsigned int skbs);
+						 unsigned int limit);
+	int			(*sk_fill_write_queue)(struct sock *sk,


I think we should document this breaking change, explaining why sk_prepare_xmit is deprecated, and why the new sk_fill_write_queue requires some somewhat complicated limit calculations. If not, maybe the author himself will forget about it after a few weeks. It's better to document it in the code as a comment, because tempesta-tech/tempesta#1196 only discusses some theoretical design (which is even outdated after the code rewrite), not the actual implementation.

Yeah, it could make sense to document both the two callback, when they are called and what they do

Personally, I don't quite understand why we moved frames from prepare_xmit to tcp_push_pending_frames, which is the biggest change in this PR. In prepare_xmit, we already know that TCP has enough window to send data, but in tcp_push_pending_frames, we have to calculate it ourselves, which seems to be duplication of work. Could you clarify this further in comments?

Yes, I will add comment before merge PR. There were several problems with the previous approach:

We push response skb directly to socket write queue. If skb belongs to stream, which is blocked by exceeding HTTP2 stream window, we can't send this skb to the client. This skb blockes all other skb in socket write queue, until window update will come.

We can't prioritize streams. We push responses to socket write queue directly, so there is no prioritization. Now we push response skbs to our private scheduler queue. Then we call tcp_push_pending_frames where we run our scheduler to choose the most priority stream and send data according TCP window. Also tcp_push_pending_frames called from tcp_data_snd_check when we receive data from the peer and TCP window opened again.

There were several problems with the previous approach: - We push response skb directly to socket write queue. If skb belongs to stream, which is blocked by exceeding HTTP2 stream window, we can't send this skb to the client. This skb blockes all other skb in socket write queue, until window update will come. - We can't prioritize streams. We push responses to socket write queue directly, so there is no prioritization. Because of reworking framing in Tempesta FW make some changes in Linux kernel: - Remove unused `sk_prepare_xmit` callback - Implement new `sk_fill_write_queue` callback, which is used to move skbs from Tempesta private write queue to socket write queue. - Add new `SOCK_TEMPESTA_HAS_DATA` socket flag, which is used to indicate, that there is some data in Tempesta FW write queue. Now we push response skbs to our private scheduler queue. Then we call `tcp_push_pending_frames` where we run our scheduler to choose the most priority stream and send data according TCP window. Also `tcp_push_pending_frames` is called from `tcp_data_snd_check` when we receive data from the peer and TCP window opened again.

Since now we don't use skb private data in xmit callbacks, we can use `skb->cb` to save skb private data for our purposes.

- Codestyle fixes and move `tso_fragment` to static scope. - Close socket immediately in case when new socket callback sk_fill_write_queue return error. - `sk_fill_write_queue` returns positive value if there is not sent data in our scheduler.

Use minimum of tcp sender window, tcp receiver window and tls max payload size to calculate size of tls record instead of using mss for this purpose.

Pass TCP_NAGLE_OFF | TCP_NAGLE_PUSH flags when we call `__tcp_push_pending_frames` frames in kernel code as we do when we call it from Tempesta FW source code. Export `tcp_cwnd_restart` to be able to call `tcp_slow_start_after_idle_check` from Tempesta FW code.

We should check `seg_size` in `tcp_write_wakeup` only when it uses as skb->len limit.

In Tempesta FW code we need to implement different behaviour depends on place where `sk_fill_write_queue` is called. (We need to send `tcp_shutdown` if this function is called from `ss_shutdown` and do not send it if it is called from any other place.

There are a lot of places in our source code, which are not clear.

EvgeniiMekhanik marked this pull request as draft July 31, 2023 10:10

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 2 times, most recently from 46d5070 to 45138c4 Compare August 8, 2023 15:47

EvgeniiMekhanik marked this pull request as ready for review August 10, 2023 14:02

EvgeniiMekhanik requested review from const-t and krizhanovsky August 10, 2023 14:02

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch from 15dfeaf to 6c253f3 Compare September 12, 2023 09:09

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch from 6c253f3 to ae1718e Compare October 16, 2023 09:39

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 4 times, most recently from 9660c99 to 042a7d9 Compare November 23, 2023 17:54

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 4 times, most recently from 67f5e65 to fe13b02 Compare December 25, 2023 07:23

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch from fe13b02 to 2960327 Compare January 5, 2024 08:53

const-t reviewed Jan 29, 2024

View reviewed changes

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch from 8d02a09 to 13850d4 Compare May 1, 2024 04:06

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 2 times, most recently from b198d70 to 9472c4f Compare May 10, 2024 09:44

krizhanovsky approved these changes May 19, 2024

View reviewed changes

include/net/tcp.h Show resolved Hide resolved

net/ipv4/tcp_output.c Show resolved Hide resolved

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 2 times, most recently from 437c1ef to 6bc466c Compare May 27, 2024 16:53

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 4 times, most recently from fdab4d7 to d1aaa66 Compare June 4, 2024 12:14

EvgeniiMekhanik requested a review from krizhanovsky June 5, 2024 17:33

krizhanovsky approved these changes Jun 9, 2024

View reviewed changes

const-t requested changes Jun 12, 2024

View reviewed changes

kingluo reviewed Jun 13, 2024

View reviewed changes

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch 6 times, most recently from 18581a9 to 9092882 Compare June 24, 2024 13:38

EvgeniiMekhanik added 9 commits June 25, 2024 17:38

Remove skb flags.

da588fb

Implement new way to save skb private data.

603f769

Since now we don't use skb private data in xmit callbacks, we can use `skb->cb` to save skb private data for our purposes.

Fixes according review

14ed604

- Codestyle fixes and move `tso_fragment` to static scope. - Close socket immediately in case when new socket callback sk_fill_write_queue return error. - `sk_fill_write_queue` returns positive value if there is not sent data in our scheduler.

Fix according review.

1b67f4d

Use minimum of tcp sender window, tcp receiver window and tls max payload size to calculate size of tls record instead of using mss for this purpose.

Move checking seg_size in tcp_write_wakeup.

a0a9fc6

We should check `seg_size` in `tcp_write_wakeup` only when it uses as skb->len limit.

Add comments.

b92249a

There are a lot of places in our source code, which are not clear.

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196 branch from 9092882 to b92249a Compare June 25, 2024 14:39

EvgeniiMekhanik merged commit 09f6e2a into main Jun 27, 2024

EvgeniiMekhanik deleted the MekhanikEvgenii/1196 branch June 27, 2024 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mekhanik evgenii/1196 #17

Mekhanik evgenii/1196 #17

EvgeniiMekhanik commented Jul 31, 2023

const-t Jan 26, 2024

EvgeniiMekhanik May 1, 2024

const-t commented Jan 31, 2024

krizhanovsky left a comment

krizhanovsky left a comment

krizhanovsky Jun 9, 2024

const-t left a comment

const-t Jun 12, 2024

EvgeniiMekhanik Jun 13, 2024

const-t Jun 12, 2024

EvgeniiMekhanik Jun 13, 2024 •

edited

Loading

kingluo Jun 13, 2024

EvgeniiMekhanik Jun 13, 2024

kingluo Jun 13, 2024 •

edited

Loading

EvgeniiMekhanik Jun 13, 2024

EvgeniiMekhanik Jun 13, 2024

EvgeniiMekhanik Jun 13, 2024

kingluo Jun 13, 2024 •

edited

Loading

EvgeniiMekhanik Jun 17, 2024

kingluo Jun 17, 2024 •

edited

Loading

krizhanovsky Jun 25, 2024 •

edited

Loading

kingluo Jun 13, 2024

krizhanovsky Jun 13, 2024

kingluo Jun 17, 2024 •

edited

Loading

EvgeniiMekhanik Jun 17, 2024

	if (!sk->sk_write_xmit \|\| !skb_tfw_tls_type(skb))
	if (unlikely(!sk->sk_write_xmit \|\| !skb_tfw_tls_type(skb)))

Mekhanik evgenii/1196 #17

Mekhanik evgenii/1196 #17

Conversation

EvgeniiMekhanik commented Jul 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

const-t commented Jan 31, 2024

krizhanovsky left a comment

Choose a reason for hiding this comment

krizhanovsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

const-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EvgeniiMekhanik Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kingluo Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kingluo Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kingluo Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

krizhanovsky Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kingluo Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EvgeniiMekhanik Jun 13, 2024 •

edited

Loading

kingluo Jun 13, 2024 •

edited

Loading

kingluo Jun 13, 2024 •

edited

Loading

kingluo Jun 17, 2024 •

edited

Loading

krizhanovsky Jun 25, 2024 •

edited

Loading

kingluo Jun 17, 2024 •

edited

Loading